[TRTLLM-8920][feat] decouple disagg service from fastapi #8714

reasonsolo · 2025-10-28T07:51:35Z

Split openai_disagg_server.py into different modules:

fastapi related code remains in openai_disagg_server.py
request dispatching goes to openai_disagg_service.py
sending http request to ctx/gen goes to openai_client.py
perf metrics goes to perf_metrics.py, also add prometheus metrics for disagg-serving

Summary by CodeRabbit

Release Notes

New Features
- Added OpenAI-compatible HTTP client with streaming support and automatic retry logic for improved reliability.
- Introduced disaggregated serving with separate context and generation phases for optimized performance.
- Added performance metrics collection and health monitoring capabilities.
- Enabled real-time worker event callbacks for dynamic cluster management.
Refactor
- Streamlined server architecture with improved lifecycle management and cleaner service patterns.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-11-03T11:03:19Z

📝 Walkthrough

Walkthrough

The changes introduce a service-oriented architecture for OpenAI disaggregated serving. New abstractions include OpenAIService base class, OpenAIClient interface with HTTP implementation, OpenAIDisaggregatedService for orchestrating context/generation flows, metrics collection infrastructure, and response hooks. Supporting utilities and routing are updated to integrate these components.

Changes

Cohort / File(s)	Summary
Type Definitions & Protocols `tensorrt_llm/serve/openai_protocol.py`, `tensorrt_llm/serve/openai_service.py`, `tensorrt_llm/serve/responses_utils.py`	Added type aliases UCompletionRequest/UCompletionResponse; introduced abstract OpenAIService base with lifecycle methods; added ResponseHooks abstract interface for request lifecycle callbacks; added CompletionResponseGenerator type alias and done_generator utility; enhanced ServerArrivalTimeMiddleware to populate ctx_server and gen_server in HTTP scope.
HTTP Client Abstraction `tensorrt_llm/serve/openai_client.py`	Introduced abstract OpenAIClient interface with send_request, _send_request, collect_metrics, check_ready, shutdown, and _finish_request methods; implemented OpenAIHttpClient with aiohttp session management, request retry logic with backoff, streaming response handling via _response_generator, per-token latency metrics, completion callbacks through hooks, and integration with Router for server selection.
Performance Metrics Collection `tensorrt_llm/serve/perf_metrics.py`	Added ClientMetricsCollector for per-client counters and histograms (total, error, retry, completed, latency); added DisaggPerfMetricsCollector for aggregating per-request metrics across clients, maintaining bounded request queue, matching generation/context metrics, and deferred processing of unfinished requests.
Disaggregated Service Implementation `tensorrt_llm/serve/openai_disagg_service.py`	Introduced OpenAIDisaggregatedService (subclass of OpenAIService) orchestrating context and generation requests with conditional disaggregation, request normalization, server readiness waiting, worker-event handling, and response validation; added OpenAIDisaggregatedPreAllocService variant with parallelized context/generation execution via TaskGroup.
Server Integration & Routing `tensorrt_llm/serve/openai_disagg_server.py`, `tensorrt_llm/llmapi/disagg_utils.py`	Refactored server to use service-oriented approach with OpenAIDisaggregatedService, replaced in-process routing with RawRequestResponseHooks for metrics collection; exposed /prometheus/metrics and /perf_metrics endpoints; added steady clock offset alignment; added helper functions for router creation and client initialization; simplified entry points via _wrap_entry_point; renamed get_ctx_gen_server_urls to get_ctx_gen_server_addrs and removed "http://" prefix from address construction.
Worker Management `tensorrt_llm/serve/disagg_auto_scaling.py`	Extended DisaggClusterManager.watch_workers to accept optional on_event callback; added background task (_watch_task) to drain and forward subsequent events to callback; processes existing workers on first run; returns empty list if watch already initialized.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server as OpenAI Server
    participant Service as OpenAIDisaggregatedService
    participant CtxRouter as Context Router
    participant GenRouter as Generation Router
    participant Metrics as PerfMetricsCollector

    Client->>Server: POST /v1/chat/completions
    Server->>Server: _wrap_entry_point (create hooks)
    Server->>Service: openai_chat_completion(request, hooks)
    Service->>Service: _check_conditional_disagg()
    
    rect rgb(200, 220, 255)
    Note over Service: Disaggregated Path
    Service->>CtxRouter: send_request(ctx_request)
    CtxRouter-->>Service: context_response
    Service->>Metrics: on_ctx_resp
    end
    
    Service->>Service: _need_gen(ctx_response)
    
    rect rgb(220, 200, 255)
    Note over Service: Generation Path
    Service->>GenRouter: send_request(gen_request)
    GenRouter-->>Service: generation_response (streaming)
    Service->>Metrics: on_first_token
    end
    
    Service-->>Server: CompletionResponseGenerator
    rect rgb(200, 255, 220)
    Note over Server: Stream Handling
    loop per token
        Server->>Client: token chunk
        Server->>Metrics: update latency
    end
    end
    
    Server->>Metrics: on_resp_done
    Metrics-->>Client: response complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

openai_client.py: Dense async logic with retry mechanisms, streaming response handling, and metrics integration requires careful examination of error paths and state management
openai_disagg_service.py: Complex orchestration logic for context/generation request coordination, conditional disaggregation paths, and worker event handling needs thorough flow validation
perf_metrics.py: Thread-safe metric aggregation with bounded queues and deferred processing logic; edge cases around unfinished requests require close inspection
openai_disagg_server.py: Large-scale refactoring from in-process to service-oriented; clock offset synchronization and metrics exposure pathways need validation
disagg_auto_scaling.py: Background task management and callback invocation semantics should be verified for correctness
Interaction between components: Verify hook invocation sequencing, metric correlation across ctx/gen servers, and error handling propagation through service layers

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	PR description lacks concrete details about the refactoring. While the PR objectives mention decoupling disagg service from fastapi and splitting modules, the PR description provided only contains template guidance without explaining the motivation, specific changes, or test coverage.	Add a clear Description section explaining why this refactoring is needed and what benefits it provides. Add a Test Coverage section listing relevant tests that validate the split modules and decoupled service. Fill out the PR Checklist to confirm compliance with coding guidelines and testing requirements.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title '[TRTLLM-8920][feat] decouple disagg service from fastapi' clearly and concisely summarizes the main objective of the changes. It follows the required format with a JIRA ticket identifier, feature type, and a descriptive summary that accurately reflects the architectural refactoring shown in the raw summary—specifically the decoupling of the disaggregation service from FastAPI dependencies. The title is specific enough that reviewers can understand the primary change at a glance.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/serve/disagg_auto_scaling.py (1)
101-123: Do not overwrite the initialized watch handle

After seeding existing workers, Line 121 reassigns _watch_handle with a second watch(...) call. That overwrites the first handle, so the events you just injected into the queue are lost and downstream consumers miss the pre-existing workers. Please keep the first handle and remove the redundant watch.
-        self._watch_handle = await self._cluster_storage.watch(
-            self.worker_key_prefix)
-
-        async def on_event_wrapper():
+        async def on_event_wrapper():

🧹 Nitpick comments (1)

tensorrt_llm/serve/openai_service.py (1)
18-22: Clarify the openai_completion return contract

Line 18 claims implementations should return a tuple, but the signature on Line 17 returns Union[CompletionResponse, CompletionResponseGenerator]. This contradiction is confusing for implementers and reviewers. Please align the docstring with the declared type (or update the type) so the contract is unambiguous.

Apply this diff to the docstring:
@@
-        """
-        Return a tuple of (completion response, async completion response generator)
-        When request is streaming, the generator will be used to stream the response.
-        When request is not streaming, the generator will be ignore and the response will be returned directly.
-        """
+        """Return either a CompletionResponse or a CompletionResponseGenerator.
+
+        Implementations should yield the generator when `request.stream` is true
+        and otherwise return the complete response object directly.
+        """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8303cfa and 8ea9d89.

📒 Files selected for processing (9)

tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
tensorrt_llm/serve/disagg_auto_scaling.py (4 hunks)
tensorrt_llm/serve/openai_client.py (1 hunks)
tensorrt_llm/serve/openai_disagg_server.py (3 hunks)
tensorrt_llm/serve/openai_disagg_service.py (1 hunks)
tensorrt_llm/serve/openai_protocol.py (1 hunks)
tensorrt_llm/serve/openai_service.py (1 hunks)
tensorrt_llm/serve/perf_metrics.py (1 hunks)
tensorrt_llm/serve/responses_utils.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/serve/openai_service.py
tensorrt_llm/serve/disagg_auto_scaling.py
tensorrt_llm/serve/openai_disagg_service.py
tensorrt_llm/serve/perf_metrics.py
tensorrt_llm/serve/openai_client.py
tensorrt_llm/serve/openai_disagg_server.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/serve/openai_service.py
tensorrt_llm/serve/disagg_auto_scaling.py
tensorrt_llm/serve/openai_disagg_service.py
tensorrt_llm/serve/perf_metrics.py
tensorrt_llm/serve/openai_client.py
tensorrt_llm/serve/openai_disagg_server.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/serve/openai_service.py
tensorrt_llm/serve/disagg_auto_scaling.py
tensorrt_llm/serve/openai_disagg_service.py
tensorrt_llm/serve/perf_metrics.py
tensorrt_llm/serve/openai_client.py
tensorrt_llm/serve/openai_disagg_server.py

🧬 Code graph analysis (7)

tensorrt_llm/serve/responses_utils.py (1)

tensorrt_llm/serve/openai_disagg_server.py (4)

on_req_begin (47-48)

on_ctx_resp (50-51)

on_first_token (53-55)

on_resp_done (57-59)

tensorrt_llm/serve/openai_service.py (3)

tensorrt_llm/serve/openai_protocol.py (4)

ChatCompletionRequest (500-717)

ChatCompletionResponse (441-450)

CompletionRequest (210-341)

CompletionResponse (146-155)

tensorrt_llm/serve/openai_disagg_service.py (5)

openai_completion (61-77)

openai_chat_completion (79-84)

is_ready (183-186)

setup (196-217)

teardown (219-228)

tensorrt_llm/serve/disagg_auto_scaling.py (1)

is_ready (224-225)

tensorrt_llm/serve/disagg_auto_scaling.py (2)

tensorrt_llm/serve/cluster_storage.py (10)

WatchEventType (26-28)

watch (94-95)

watch (239-248)

watch (388-390)

watch (533-546)

drain (43-51)

unwatch (98-99)

unwatch (250-257)

unwatch (392-394)

unwatch (548-551)

tensorrt_llm/logger.py (2)

error (126-127)

warning (132-133)

tensorrt_llm/serve/openai_disagg_service.py (8)

tensorrt_llm/llmapi/disagg_utils.py (4)

ConditionalDisaggConfig (42-43)

DisaggClusterConfig (59-64)

DisaggServerConfig (68-78)

ServerRole (19-22)

tensorrt_llm/serve/cluster_storage.py (2)

ClusterStorage (60-106)

WatchEventType (26-28)

tensorrt_llm/serve/disagg_auto_scaling.py (9)

DisaggClusterManager (32-229)

WorkerInfo (17-21)

is_ready (224-225)

cluster_info (65-82)

start (58-59)

watch_workers (96-143)

stop (61-63)

worker_info (265-269)

worker_id (261-262)

tensorrt_llm/serve/openai_client.py (4)

OpenAIClient (28-70)

send_request (29-41)

check_ready (61-63)

check_ready (236-249)

tensorrt_llm/serve/openai_protocol.py (3)

ChatCompletionRequest (500-717)

CompletionRequest (210-341)

DisaggregatedParams (104-109)

tensorrt_llm/serve/openai_service.py (6)

OpenAIService (13-39)

openai_completion (15-23)

is_ready (33-33)

openai_chat_completion (26-30)

setup (36-36)

teardown (39-39)

tensorrt_llm/serve/responses_utils.py (3)

ResponseHooks (894-919)

done_generator (922-923)

on_req_begin (900-901)

tensorrt_llm/serve/router.py (5)

KvCacheAwareRouter (541-647)

Router (146-410)

start_server_monitoring (204-216)

stop_server_monitoring (218-231)

remove_server (183-194)

tensorrt_llm/serve/perf_metrics.py (1)

tensorrt_llm/serve/openai_client.py (2)

collect_metrics (58-58)

collect_metrics (223-231)

tensorrt_llm/serve/openai_client.py (4)

tensorrt_llm/serve/openai_protocol.py (4)

ChatCompletionRequest (500-717)

ChatCompletionResponse (441-450)

CompletionRequest (210-341)

CompletionResponse (146-155)

tensorrt_llm/serve/perf_metrics.py (5)

ClientMetricsCollector (42-60)

inc (56-57)

inc (150-151)

observe (59-60)

observe (153-154)

tensorrt_llm/serve/responses_utils.py (5)

ResponseHooks (894-919)

get_steady_clock_now_in_seconds (86-87)

on_ctx_resp (904-905)

on_first_token (908-912)

on_resp_done (915-919)

tensorrt_llm/serve/router.py (1)

Router (146-410)

tensorrt_llm/serve/openai_disagg_server.py (8)

tensorrt_llm/executor/executor.py (1)

CppExecutorError (60-68)

tensorrt_llm/llmapi/disagg_utils.py (3)

DisaggServerConfig (68-78)

MetadataServerConfig (82-87)

get_ctx_gen_server_addrs (90-101)

tensorrt_llm/serve/cluster_storage.py (4)

HttpClusterStorageServer (142-296)

create_cluster_storage (109-114)

client (464-465)

add_routes (158-167)

tensorrt_llm/serve/openai_client.py (2)

OpenAIClient (28-70)

OpenAIHttpClient (73-249)

tensorrt_llm/serve/openai_disagg_service.py (8)

OpenAIDisaggregatedService (34-271)

disagg_cluster_config (189-190)

setup (196-217)

teardown (219-228)

openai_completion (61-77)

openai_chat_completion (79-84)

cluster_info (177-181)

is_ready (183-186)

tensorrt_llm/serve/responses_utils.py (6)

ResponseHooks (894-919)

get_steady_clock_now_in_seconds (86-87)

on_req_begin (900-901)

on_ctx_resp (904-905)

on_first_token (908-912)

on_resp_done (915-919)

tensorrt_llm/serve/perf_metrics.py (4)

DisaggPerfMetricsCollector (63-154)

add_per_request_metrics (74-91)

add_client (71-72)

get_perf_metrics (93-148)

tensorrt_llm/serve/router.py (2)

Router (146-410)

create_router (650-685)

🪛 Ruff (0.14.2)

tensorrt_llm/serve/disagg_auto_scaling.py

136-136: Do not catch blind exception: Exception

(BLE001)

tensorrt_llm/serve/openai_disagg_service.py

40-40: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

65-65: Avoid specifying long messages outside the exception class

(TRY003)

73-75: Avoid specifying long messages outside the exception class

(TRY003)

83-83: Avoid specifying long messages outside the exception class

(TRY003)

249-249: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

249-249: Avoid specifying long messages outside the exception class

(TRY003)

264-266: Avoid specifying long messages outside the exception class

(TRY003)

268-268: Avoid specifying long messages outside the exception class

(TRY003)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/serve/openai_client.py

41-41: Prefer TypeError exception for invalid type

(TRY004)

41-41: Avoid specifying long messages outside the exception class

(TRY003)

65-65: OpenAIClient.shutdown is an empty method in an abstract base class, but has no abstract decorator

(B027)

80-80: Unused method argument: perf_metrics_collector

(ARG002)

139-141: Abstract raise to an inner function

(TRY301)

139-141: Avoid specifying long messages outside the exception class

(TRY003)

229-230: try-except-continue detected, consider logging the exception

(S112)

229-229: Do not catch blind exception: Exception

(BLE001)

241-241: Do not catch blind exception: Exception

(BLE001)

247-247: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

248-248: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tensorrt_llm/serve/openai_disagg_server.py

17-17: Redefinition of unused CppExecutorError from line 16

(F811)

47-47: Unused method argument: request

(ARG002)

50-50: Unused method argument: response

(ARG002)

53-53: Unused method argument: request

(ARG002)

53-53: Unused method argument: response

(ARG002)

57-57: Unused method argument: response

(ARG002)

59-59: Store a reference to the return value of asyncio.create_task

(RUF006)

93-93: Do not catch blind exception: Exception

(BLE001)

98-98: Unused function argument: app

(ARG001)

141-141: Do not catch blind exception: Exception

(BLE001)

154-154: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/serve/responses_utils.py (1)

894-919: Hooks abstraction looks solid

The lifecycle interface gives us the right touchpoints for the disagg instrumentation work. Nice addition.

tensorrt_llm/serve/openai_client.py

tensorrt_llm/serve/openai_disagg_server.py

tensorrt_llm/serve/openai_disagg_service.py

tensorrt_llm/serve/perf_metrics.py

coderabbitai · 2025-11-03T11:14:27Z

📝 Walkthrough

Walkthrough

This PR introduces a disaggregated OpenAI serving architecture with service-oriented design. It adds HTTP client abstractions, performance metrics collection, event-driven worker management, and new service orchestration modules to coordinate context and generation server workflows.

Changes

Cohort / File(s)	Summary
Core API and Utilities `tensorrt_llm/llmapi/disagg_utils.py`	Renamed `get_ctx_gen_server_urls()` to `get_ctx_gen_server_addrs()` and changed address format from `"http://host:port"` to `"host:port"`.
Worker Management `tensorrt_llm/serve/disagg_auto_scaling.py`	Added optional `on_event` callback support to `watch_workers()` for real-time worker lifecycle events; introduced `_watch_task` attribute and guard check in `get_worker_events()`.
Protocol and Base Types `tensorrt_llm/serve/openai_protocol.py`, `tensorrt_llm/serve/openai_service.py`, `tensorrt_llm/serve/responses_utils.py`	Added type aliases `UCompletionRequest` and `UCompletionResponse`; introduced abstract `OpenAIService` interface with completion/chat-completion and lifecycle methods; added `ResponseHooks` abstract interface with four lifecycle callbacks, `done_generator()` function, and `CompletionResponseGenerator` type alias. Extended ASGI middleware to populate `ctx_server` and `gen_server` scope fields.
HTTP Client `tensorrt_llm/serve/openai_client.py`	Introduced abstract `OpenAIClient` base class and concrete `OpenAIHttpClient` implementation with retry logic, streaming/non-streaming response handling, metrics integration, and health/metrics endpoints.
Performance Metrics `tensorrt_llm/serve/perf_metrics.py`	Added `ClientMetricsCollector` and `DisaggPerfMetricsCollector` classes for per-request and aggregated metrics tracking across context/generation servers with async-safe coordination.
Service Orchestration `tensorrt_llm/serve/openai_disagg_service.py`	New module with `OpenAIDisaggregatedService` and `OpenAIDisaggregatedPreAllocService` providing core disaggregated request dispatch, conditional server routing, readiness checks, and worker event handling.
Server Implementation `tensorrt_llm/serve/openai_disagg_server.py`	Major refactor replacing monolithic logic with streamlined `OpenAIDisaggServer` using service factory pattern; introduced `RawRequestResponseHooks` for per-request metrics; refactored route registration, health/version endpoints, and clock offset logic with new Prometheus metrics integration.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server as OpenAIDisaggServer
    participant Service as OpenAIDisaggregatedService
    participant CtxClient as OpenAIHttpClient<br/>(Context)
    participant GenClient as OpenAIHttpClient<br/>(Generation)
    participant Metrics as DisaggPerfMetricsCollector
    participant Router as Router

    Client->>Server: POST /v1/completions
    Note over Server: Wrap with RawRequestResponseHooks
    Server->>Service: openai_completion(request)
    
    Service->>Service: Check readiness
    Service->>Router: Get available context server
    
    rect rgb(200, 220, 255)
        Note over CtxClient: Context Phase
        Service->>CtxClient: send_request(ctx_server, request)
        CtxClient->>CtxClient: POST with retry logic
        CtxClient->>Metrics: Track metrics
        CtxClient-->>Service: Context response + disagg_params
    end
    
    Service->>Service: _check_conditional_disagg()
    Service->>Router: Select generation server
    
    rect rgb(220, 200, 255)
        Note over GenClient: Generation Phase
        Service->>GenClient: send_request(gen_server, request)
        GenClient->>GenClient: Stream response
        GenClient->>Metrics: Track per-token metrics
        GenClient-->>Service: Streaming events
    end
    
    Service-->>Server: CompletionResponseGenerator
    Server-->>Client: Streaming response with hooks

    Metrics->>Metrics: Aggregate metrics by request_id

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

New module structure: Three new service modules (openai_disagg_service.py, openai_client.py, perf_metrics.py) introduce substantial new logic requiring separate reasoning for each component.
Architectural refactor: openai_disagg_server.py undergoes significant restructuring from monolithic to service-oriented design; public API surface changes require careful validation against existing callers.
Event-driven callbacks: disagg_auto_scaling.py introduces async callback chains with task management; concurrent event handling requires careful review.
Metrics aggregation complexity: DisaggPerfMetricsCollector coordinates async state across multiple clients with per-request key management and capacity-bounded deques; potential race conditions and resource leaks warrant scrutiny.
HTTP client retry and streaming logic: OpenAIHttpClient._post_with_retry() and _response_generator() handle streaming responses with metrics hooks at multiple points; edge cases in response parsing and hook invocation need validation.
Breaking public API changes: Function rename in disagg_utils.py and class constructor signature changes in openai_disagg_server.py require impact analysis.

Areas requiring extra attention:

Concurrent task management in disagg_auto_scaling.py::watch_workers() with _watch_task lifecycle
Async coordination and lock usage patterns in DisaggPerfMetricsCollector to prevent deadlocks
Streaming response handling edge cases in OpenAIHttpClient._response_generator() (partial frames, encoding issues)
Router/metadata server initialization sequence in openai_disagg_server.py::setup()
Conditional disaggregation logic in OpenAIDisaggregatedService._check_conditional_disagg()
Metrics hook invocation points and their impact on request latency

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely incomplete and does not follow the repository's PR template effectively. While the template sections for Description and Test Coverage are present, they contain only comments with no substantive content provided by the author. The author merely states 'Still working on a hang issue with streaming requests, but open for review to speed up the review process,' which is insufficient to explain the changes. The PR Checklist is mostly empty (only one checkbox marked), and the description does not clearly explain what changes were made, why they were made, or what test coverage exists for the extensive modifications across multiple new modules and significant refactoring.	The author should complete the PR description by: (1) providing a clear, detailed explanation of the disagg service decoupling from FastAPI, including architectural changes and benefits; (2) listing the relevant test cases that validate the new abstractions and implementations; (3) completing the PR checklist items to confirm adherence to coding guidelines, test coverage, and documentation updates; and (4) removing or clarifying the mention of an unresolved 'hang issue' to indicate readiness for review.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title '[TRTLLM-8920][feat] decouple disagg service from fastapi' is clear, specific, and directly relates to the primary change across the files. It accurately reflects the main objective: decoupling the disaggregated service architecture from FastAPI by introducing abstract service interfaces (OpenAIService), abstract client interfaces (OpenAIClient), and new service implementations (OpenAIDisaggregatedService, OpenAIHttpClient) that enable framework-independent request handling. The title follows the repository's naming convention with a JIRA ticket and feature type tag.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8303cfa and 8ea9d89.

📒 Files selected for processing (9)

tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
tensorrt_llm/serve/disagg_auto_scaling.py (4 hunks)
tensorrt_llm/serve/openai_client.py (1 hunks)
tensorrt_llm/serve/openai_disagg_server.py (3 hunks)
tensorrt_llm/serve/openai_disagg_service.py (1 hunks)
tensorrt_llm/serve/openai_protocol.py (1 hunks)
tensorrt_llm/serve/openai_service.py (1 hunks)
tensorrt_llm/serve/perf_metrics.py (1 hunks)
tensorrt_llm/serve/responses_utils.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_service.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_disagg_server.py
tensorrt_llm/serve/perf_metrics.py
tensorrt_llm/serve/openai_disagg_service.py
tensorrt_llm/serve/openai_client.py
tensorrt_llm/serve/disagg_auto_scaling.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_service.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_disagg_server.py
tensorrt_llm/serve/perf_metrics.py
tensorrt_llm/serve/openai_disagg_service.py
tensorrt_llm/serve/openai_client.py
tensorrt_llm/serve/disagg_auto_scaling.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_service.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_disagg_server.py
tensorrt_llm/serve/perf_metrics.py
tensorrt_llm/serve/openai_disagg_service.py
tensorrt_llm/serve/openai_client.py
tensorrt_llm/serve/disagg_auto_scaling.py

🧬 Code graph analysis (7)

tensorrt_llm/serve/openai_service.py (3)

tensorrt_llm/serve/openai_protocol.py (4)

ChatCompletionRequest (500-717)

ChatCompletionResponse (441-450)

CompletionRequest (210-341)

CompletionResponse (146-155)

tensorrt_llm/serve/openai_disagg_service.py (5)

openai_completion (61-77)

openai_chat_completion (79-84)

is_ready (183-186)

setup (196-217)

teardown (219-228)

tensorrt_llm/serve/disagg_auto_scaling.py (1)

is_ready (224-225)

tensorrt_llm/serve/responses_utils.py (2)

tensorrt_llm/serve/openai_protocol.py (1)

ResponsesResponse (846-911)

tensorrt_llm/serve/openai_disagg_server.py (4)

on_req_begin (47-48)

on_ctx_resp (50-51)

on_first_token (53-55)

on_resp_done (57-59)

tensorrt_llm/serve/openai_disagg_server.py (8)

tensorrt_llm/executor/executor.py (1)

CppExecutorError (60-68)

tensorrt_llm/llmapi/disagg_utils.py (3)

DisaggServerConfig (68-78)

MetadataServerConfig (82-87)

get_ctx_gen_server_addrs (90-101)

tensorrt_llm/serve/cluster_storage.py (4)

HttpClusterStorageServer (142-296)

create_cluster_storage (109-114)

client (464-465)

add_routes (158-167)

tensorrt_llm/serve/openai_client.py (2)

OpenAIClient (28-70)

OpenAIHttpClient (73-249)

tensorrt_llm/serve/openai_disagg_service.py (7)

OpenAIDisaggregatedService (34-271)

disagg_cluster_config (189-190)

setup (196-217)

teardown (219-228)

openai_completion (61-77)

cluster_info (177-181)

is_ready (183-186)

tensorrt_llm/serve/responses_utils.py (6)

ResponseHooks (894-919)

get_steady_clock_now_in_seconds (86-87)

on_req_begin (900-901)

on_ctx_resp (904-905)

on_first_token (908-912)

on_resp_done (915-919)

tensorrt_llm/serve/perf_metrics.py (4)

DisaggPerfMetricsCollector (63-154)

add_per_request_metrics (74-91)

add_client (71-72)

get_perf_metrics (93-148)

tensorrt_llm/serve/router.py (2)

Router (146-410)

create_router (650-685)

tensorrt_llm/serve/perf_metrics.py (1)

tensorrt_llm/serve/openai_client.py (2)

collect_metrics (58-58)

collect_metrics (223-231)

tensorrt_llm/serve/openai_disagg_service.py (8)

tensorrt_llm/llmapi/disagg_utils.py (3)

ConditionalDisaggConfig (42-43)

DisaggClusterConfig (59-64)

ServerRole (19-22)

tensorrt_llm/serve/cluster_storage.py (2)

ClusterStorage (60-106)

WatchEventType (26-28)

tensorrt_llm/serve/disagg_auto_scaling.py (9)

DisaggClusterManager (32-229)

WorkerInfo (17-21)

is_ready (224-225)

cluster_info (65-82)

start (58-59)

watch_workers (96-143)

stop (61-63)

worker_info (265-269)

worker_id (261-262)

tensorrt_llm/serve/openai_client.py (6)

OpenAIClient (28-70)

send_request (29-41)

shutdown (65-65)

shutdown (233-234)

check_ready (61-63)

check_ready (236-249)

tensorrt_llm/serve/openai_protocol.py (3)

ChatCompletionRequest (500-717)

CompletionRequest (210-341)

DisaggregatedParams (104-109)

tensorrt_llm/serve/openai_service.py (6)

OpenAIService (13-39)

openai_completion (15-23)

is_ready (33-33)

openai_chat_completion (26-30)

setup (36-36)

teardown (39-39)

tensorrt_llm/serve/responses_utils.py (3)

ResponseHooks (894-919)

done_generator (922-923)

on_req_begin (900-901)

tensorrt_llm/serve/router.py (5)

KvCacheAwareRouter (541-647)

Router (146-410)

start_server_monitoring (204-216)

stop_server_monitoring (218-231)

remove_server (183-194)

tensorrt_llm/serve/openai_client.py (5)

tensorrt_llm/serve/openai_protocol.py (4)

ChatCompletionRequest (500-717)

ChatCompletionResponse (441-450)

CompletionRequest (210-341)

CompletionResponse (146-155)

tensorrt_llm/serve/perf_metrics.py (6)

ClientMetricsCollector (42-60)

DisaggPerfMetricsCollector (63-154)

inc (56-57)

inc (150-151)

observe (59-60)

observe (153-154)

tensorrt_llm/serve/responses_utils.py (5)

ResponseHooks (894-919)

get_steady_clock_now_in_seconds (86-87)

on_ctx_resp (904-905)

on_first_token (908-912)

on_resp_done (915-919)

tensorrt_llm/serve/router.py (1)

Router (146-410)

tensorrt_llm/serve/openai_disagg_server.py (3)

on_ctx_resp (50-51)

on_first_token (53-55)

on_resp_done (57-59)

tensorrt_llm/serve/disagg_auto_scaling.py (2)

tensorrt_llm/serve/cluster_storage.py (10)

WatchEventType (26-28)

watch (94-95)

watch (239-248)

watch (388-390)

watch (533-546)

drain (43-51)

unwatch (98-99)

unwatch (250-257)

unwatch (392-394)

unwatch (548-551)

tensorrt_llm/logger.py (2)

error (126-127)

warning (132-133)

🪛 GitHub Actions: Release Checks

tensorrt_llm/serve/openai_service.py

[error] 18-21: D205 1 blank line required between summary line and description

tensorrt_llm/serve/openai_disagg_service.py

[error] 89-93: D205 1 blank line required between summary line and description

[error] 89-89: D415 First line should end with a period, question mark, or exclamation point

[error] 91-91: E501 Line too long (122 > 120)

tensorrt_llm/serve/openai_client.py

[error] 52-52: D205 1 blank line required between summary line and description

[error] 62-62: D415 First line should end with a period, question mark, or exclamation point

[error] 111-111: F821 Undefined name anext

🪛 Ruff (0.14.2)

tensorrt_llm/serve/openai_disagg_server.py

17-17: Redefinition of unused CppExecutorError from line 16

(F811)

47-47: Unused method argument: request

(ARG002)

50-50: Unused method argument: response

(ARG002)

53-53: Unused method argument: request

(ARG002)

53-53: Unused method argument: response

(ARG002)

57-57: Unused method argument: response

(ARG002)

59-59: Store a reference to the return value of asyncio.create_task

(RUF006)

93-93: Do not catch blind exception: Exception

(BLE001)

98-98: Unused function argument: app

(ARG001)

141-141: Do not catch blind exception: Exception

(BLE001)

154-154: Use explicit conversion flag

Replace with conversion flag

(RUF010)

tensorrt_llm/serve/openai_disagg_service.py

40-40: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

65-65: Avoid specifying long messages outside the exception class

(TRY003)

73-75: Avoid specifying long messages outside the exception class

(TRY003)

83-83: Avoid specifying long messages outside the exception class

(TRY003)

249-249: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

249-249: Avoid specifying long messages outside the exception class

(TRY003)

264-266: Avoid specifying long messages outside the exception class

(TRY003)

268-268: Avoid specifying long messages outside the exception class

(TRY003)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/serve/openai_client.py

41-41: Prefer TypeError exception for invalid type

(TRY004)

41-41: Avoid specifying long messages outside the exception class

(TRY003)

65-65: OpenAIClient.shutdown is an empty method in an abstract base class, but has no abstract decorator

(B027)

80-80: Unused method argument: perf_metrics_collector

(ARG002)

139-141: Abstract raise to an inner function

(TRY301)

139-141: Avoid specifying long messages outside the exception class

(TRY003)

229-230: try-except-continue detected, consider logging the exception

(S112)

229-229: Do not catch blind exception: Exception

(BLE001)

241-241: Do not catch blind exception: Exception

(BLE001)

247-247: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

248-248: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tensorrt_llm/serve/disagg_auto_scaling.py

136-136: Do not catch blind exception: Exception

(BLE001)

tensorrt_llm/serve/disagg_auto_scaling.py

tensorrt_llm/serve/openai_client.py

tensorrt_llm/serve/openai_disagg_server.py

tensorrt_llm/serve/openai_service.py

tensorrt_llm/serve/perf_metrics.py

reasonsolo · 2025-11-04T07:58:32Z

/bot run

tensorrt-cicd · 2025-11-04T08:04:58Z

PR_Github #23487 [ run ] triggered by Bot. Commit: 50348fd

tensorrt-cicd · 2025-11-04T08:51:40Z

PR_Github #23487 [ run ] completed with state FAILURE. Commit: 50348fd
/LLM/main/L0_MergeRequest_PR pipeline #17680 completed with status: 'FAILURE'

reasonsolo · 2025-11-04T09:28:05Z

/bot run

tensorrt-cicd · 2025-11-04T09:34:24Z

PR_Github #23496 [ run ] triggered by Bot. Commit: 57451e6

tensorrt-cicd · 2025-11-04T11:19:01Z

PR_Github #23496 [ run ] completed with state SUCCESS. Commit: 57451e6
/LLM/main/L0_MergeRequest_PR pipeline #17687 completed with status: 'FAILURE'

reasonsolo · 2025-11-05T03:28:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-05T03:33:57Z

PR_Github #23576 [ run ] triggered by Bot. Commit: b419e60

tensorrt-cicd · 2025-11-05T07:44:44Z

PR_Github #23576 [ run ] completed with state SUCCESS. Commit: b419e60
/LLM/main/L0_MergeRequest_PR pipeline #17741 completed with status: 'FAILURE'

reasonsolo · 2025-11-05T10:27:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-05T10:32:50Z

PR_Github #23640 [ run ] triggered by Bot. Commit: ade914e

tensorrt-cicd · 2025-11-11T09:11:49Z

PR_Github #24129 [ run ] completed with state SUCCESS. Commit: 192a36e
/LLM/main/L0_MergeRequest_PR pipeline #18190 completed with status: 'FAILURE'

reasonsolo · 2025-11-11T09:39:50Z

/bot run

tensorrt-cicd · 2025-11-11T09:45:29Z

PR_Github #24161 [ run ] triggered by Bot. Commit: 192a36e

tensorrt-cicd · 2025-11-11T13:55:31Z

PR_Github #24161 [ run ] completed with state SUCCESS. Commit: 192a36e
/LLM/main/L0_MergeRequest_PR pipeline #18217 completed with status: 'FAILURE'

reasonsolo · 2025-11-11T14:34:57Z

/bot run

tensorrt-cicd · 2025-11-11T14:40:54Z

PR_Github #24193 [ run ] triggered by Bot. Commit: cf7deb1

tensorrt-cicd · 2025-11-12T05:03:14Z

PR_Github #24193 [ run ] completed with state SUCCESS. Commit: cf7deb1
/LLM/main/L0_MergeRequest_PR pipeline #18243 completed with status: 'FAILURE'

reasonsolo · 2025-11-12T07:06:46Z

/bot run

tensorrt-cicd · 2025-11-12T07:12:49Z

PR_Github #24264 [ run ] triggered by Bot. Commit: cf7deb1

tensorrt-cicd · 2025-11-12T20:17:03Z

PR_Github #24264 [ run ] completed with state SUCCESS. Commit: cf7deb1
/LLM/main/L0_MergeRequest_PR pipeline #18305 completed with status: 'FAILURE'

Signed-off-by: Lizhi Zhou <[email protected]>

reasonsolo · 2025-11-13T00:43:39Z

/bot run

tensorrt-cicd · 2025-11-13T00:49:57Z

PR_Github #24349 [ run ] triggered by Bot. Commit: 4541c5a

tensorrt-cicd · 2025-11-13T02:48:05Z

PR_Github #24349 [ run ] completed with state SUCCESS. Commit: 4541c5a
/LLM/main/L0_MergeRequest_PR pipeline #18377 completed with status: 'FAILURE'

reasonsolo · 2025-11-13T04:35:19Z

/bot run

tensorrt-cicd · 2025-11-13T04:42:23Z

PR_Github #24391 [ run ] triggered by Bot. Commit: 4541c5a

tensorrt-cicd · 2025-11-13T11:42:54Z

PR_Github #24391 [ run ] completed with state SUCCESS. Commit: 4541c5a
/LLM/main/L0_MergeRequest_PR pipeline #18405 completed with status: 'FAILURE'

reasonsolo · 2025-11-14T00:57:52Z

/bot run

reasonsolo · 2025-11-14T01:00:29Z

Hold this up to avoid introducing more changes of disagg-serving which caused a lot CI failures recently.

tensorrt-cicd · 2025-11-14T01:05:39Z

PR_Github #24524 [ run ] triggered by Bot. Commit: 4541c5a

tensorrt-cicd · 2025-11-14T06:00:25Z

PR_Github #24524 [ run ] completed with state SUCCESS. Commit: 4541c5a
/LLM/main/L0_MergeRequest_PR pipeline #18510 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch 3 times, most recently from 104984f to 577e9db Compare November 3, 2025 10:41

reasonsolo marked this pull request as ready for review November 3, 2025 10:44

reasonsolo requested review from a team as code owners November 3, 2025 10:44

reasonsolo requested review from Tabrizian, hchings and schetlur-nv November 3, 2025 10:44

reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 577e9db to 8ea9d89 Compare November 3, 2025 10:44

coderabbitai bot reviewed Nov 3, 2025

View reviewed changes

reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 8ea9d89 to 50348fd Compare November 4, 2025 07:56

reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 50348fd to 57451e6 Compare November 4, 2025 09:27

Shixiaowei02 requested review from pcastonguay and zhengd-nv November 5, 2025 08:30

reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from b419e60 to ade914e Compare November 5, 2025 10:26

reasonsolo enabled auto-merge (squash) November 11, 2025 10:10

reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 192a36e to cf7deb1 Compare November 11, 2025 14:34

reasonsolo added 6 commits November 13, 2025 08:43

implement prototype

c24b478

Signed-off-by: Lizhi Zhou <[email protected]>

fix tests

e0d87c9

Signed-off-by: Lizhi Zhou <[email protected]>

fix failed CI tests

ffe243b

Signed-off-by: Lizhi Zhou <[email protected]>

fix failed tests

85709f8

Signed-off-by: Lizhi Zhou <[email protected]>

address review comments

bffa734

Signed-off-by: Lizhi Zhou <[email protected]>

add a UT for OpenAIHttpClient

4541c5a

Signed-off-by: Lizhi Zhou <[email protected]>

reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from cf7deb1 to 4541c5a Compare November 13, 2025 00:43

reasonsolo disabled auto-merge November 14, 2025 00:58

[TRTLLM-8920][feat] decouple disagg service from fastapi #8714

Are you sure you want to change the base?

[TRTLLM-8920][feat] decouple disagg service from fastapi #8714

Conversation

reasonsolo commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 3, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

reasonsolo commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

reasonsolo commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

reasonsolo commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 11, 2025

Uh oh!

reasonsolo commented Nov 11, 2025

reasonsolo commented Oct 28, 2025 •

edited

Loading

coderabbitai bot commented Nov 3, 2025 •

edited

Loading